Data mining for terrorists
At the weekend Ben Goldacre wrote an article about the use (or uselessness) of data mining for national security. It is on his blog here. He attacks the idea of data mining for terrorists by considering the number of false positives that would be produced given certain assumptions about the sensitivity and specificity of the test. I am a great fan of Ben's but he is overly fond of using medical statistics models in situations where they don't easily fit.
Data mining can mean many things and can be used to address many different problems. Ben's article addresses just one scenario: studying patterns to identify potential terrorists in the UK. You might call this the credit card scenario. It is a technique that works well for detecting credit card fraud and for all sorts of good reasons is unlikely to work for detecting terrorists. Ben links to an article by Bruce Schneier that explains why. It is true that this is very likely to produce an overwhelming number of false positives but I can't believe that the people working on these things haven't realized that. They don't really have to do anything very sophisticated. They just have to ask themselves – how many people in the UK are going to match this pattern anyway?
Ben also links to an online book by the National Academies Press which identifies two types of data mining (there are others) - subject-based data mining and pattern recognition. Subject-based data mining is little more than the speeding up of normal methods of investigation. There is an incident or individual or group or potential target and the security forces need to investigate a wide variety of links. There is little serious doubt about the value of this method. It is just an extension of what the police national computer is already used for. It seems very plausible that the security forces would be able to do this even more effectively for a wider variety of situations if they had more information about the UK population on-line. Considerations of specificity and sensitivity don't come into it.
Pattern-based data mining is closer to the credit card scenario. If used crudely in isolation from other sources of information to discover potential terrorists in the whole UK population then Ben's calculation becomes relevant and it seems wildly implausible. But several things make it a plausibly useful tool. Perhaps the two most important are:
The problem to be addressed might be different.
Security forces may be trying to decide whether to raise the national security alert level because there are signs of terrorist activity (although we don't know who they are).
Other information may change the situation hugely
To see how other information can change things go back to the credit card fraud scenario. We know this works but it does create a large number of false positives – I suspect most credit card owners have had a call at some or another because their pattern of spending has been unusual. But imagine if the police believed that there is someone who has recently been working with stolen credit cards in Leeds who is reselling high value electrical goods. Now the potential of pattern matching would be increased enormously. Something similar would apply to the security forces scenario.
In the end this is a matter of whether it is worth the financial and privacy costs and that is a very difficult question when the benefits necessarily have to be described rather vaguely. But I don't see that the screening for cancer model adds much to understanding the benefits.
2 Comments:
It is hard to think of an example of what you call "other information" that would not be at the core of what the original algorithm was searching for in the first place. The argument is along the same lines as the "combining 2 independent algorithms" idea: at the end of the day your method has a given level of precision yielding a certain percentage of false positive, breaking it down into multiple stages isn't going to change the fact that 90% precision is still a very, very generous estimate.
Anonymous
I think you don't appreciate the opportunity to combine dynamic intelligence information with tried patterns from data mining.
There is an easy way to see if there is going to be too many false positives - ask yourself how many people fit the profile.
Looking at the security context as opposed to credit cards. Suppose intelligence sources tell us there is a threat from the Somali community in North London? How big is that community? Actually the answer is quite easy to find. There are about 40,000 Somalis in London - so, roughly 20,000 in North London. Suddenly your population is not looking so large. Furthermore you have strong reason to suppose there is a group of terrorists among them. Now pattern matching might well make a substantial contribution to catching one of those terrorists - not all terrorists in the UK - just those which have been identified as presenting an imminent threat.
Post a Comment
<< Home